Skip to content

feat: add Paraformer, Zipformer, Punctuation, GPU acceleration (v0.3.5)#42

Open
zhuzhuyule wants to merge 12 commits into
cjpais:mainfrom
zhuzhuyule:main
Open

feat: add Paraformer, Zipformer, Punctuation, GPU acceleration (v0.3.5)#42
zhuzhuyule wants to merge 12 commits into
cjpais:mainfrom
zhuzhuyule:main

Conversation

@zhuzhuyule

@zhuzhuyule zhuzhuyule commented Feb 27, 2026

Copy link
Copy Markdown

Updated 2026-03-30: Added punct model I/O fixes (int32 input, f32 argmax output), GPU accel module, log noise reduction. See latest comment for details.

Summary

Add sherpa-onnx speech recognition engines, neural punctuation restoration, GPU acceleration, and upgrade core to v0.3.5.

New Engines

Feature Flag Component Type
paraformer ParaformerModel Single ONNX, non-autoregressive
zipformer-ctc ZipformerCtcModel Single ONNX, CTC greedy decode
zipformer-transducer ZipformerTransducerModel 3 ONNX (encoder/decoder/joiner), RNN-T greedy search
punct PunctModel CT-Transformer punctuation restoration (zh+en)
(core) accel module GPU device enumeration and accelerator selection

Punctuation Model — Usage

ASR engines fall into two categories:

Already punctuated (skip punct model): Whisper, SenseVoice
Raw text, no punctuation (need punct model): Zipformer, Paraformer, GigaAM

Recommended pattern — auto-detect and apply:

use transcribe_rs::punct::PunctModel;

// 1. Transcribe
let result = engine.transcribe(&audio, &TranscribeOptions::default())?;

// 2. Check if output already has punctuation
let has_punct = result.text.chars().any(|c| matches!(c,
    ',' | '。' | '?' | '!' | ',' | '.' | '?' | '!'
));

// 3. Apply punct model only when needed
if !has_punct && !result.text.is_empty() {
    let mut punct = PunctModel::new(Path::new("models/punct-model/"))?;
    let punctuated = punct.add_punctuation(&result.text);
}

GPU Acceleration

use transcribe_rs::accel;
use transcribe_rs::whisper_cpp::gpu::list_gpu_devices;

// Set accelerator preference
accel::set_whisper_accelerator(accel::WhisperAccelerator::Auto);
accel::set_ort_accelerator(accel::OrtAccelerator::Auto);

// Enumerate and select GPU
let devices = list_gpu_devices();
if !devices.is_empty() {
    accel::set_whisper_gpu_device(devices[0].id);
}

// Query available accelerators
let available = accel::OrtAccelerator::available(); // ["auto", "cpu", "cuda", ...]

Key Implementation Details

  • Auto-detect model files: Handles varying naming conventions (encoder-epoch-34-avg-19.int8.onnx, encoder.int8.onnx, etc.)
  • Auto-detect token encoding: BBPE vs standard BPE, detected via bbpe.model file presence
  • Mixed quantization: int8 encoder + fp32 decoder handled transparently
  • Unified API: All engines implement the SpeechModel trait
  • External post-processing: Punctuation is caller-controlled, not embedded in engines

Tested Models

Model Language Engine Status
sherpa-onnx-paraformer-zh-2025-10-07 Chinese Paraformer
sherpa-onnx-zipformer-ctc-small-zh-int8 Chinese CTC
sherpa-onnx-zipformer-zh-en-2023-11-22 Chinese+English Transducer
sherpa-onnx-zipformer-vi-30M-int8 Vietnamese Transducer
sherpa-onnx-zipformer-ru-int8 Russian Transducer
sherpa-onnx-zipformer-korean-2024-06-24 Korean Transducer
punct-ct-transformer-zh-en-int8 Chinese+English Punct

Notes & Caveats

  • Offline only: ZipformerTransducerModel, ZipformerCtcModel, ParaformerModel run full-audio offline inference. Streaming model files may load but are not properly supported — use offline variants only.
  • Punct model is stateless: add_punctuation() processes text in 20-token windows with 2-token overlap. For realtime preview, callers should manage their own caching/anchoring strategy externally.
  • CT-Transformer int8 recommended: 62MB, ~50ms per sentence. Full-precision (266MB) is marginally better but 3x slower.
  • Punct I/O types: Input must be int32 tensors. Output is float32 logits [batch, seq_len, 6] — the library handles argmax internally.

Test plan

  • cargo check --features all — clean
  • Transcription verified across 6+ models in 5 languages
  • Punctuation verified (Chinese + English, int8 and full models)
  • GPU enumeration tested on macOS (Metal)

🤖 Generated with Claude Code

@cjpais

cjpais commented Feb 27, 2026

Copy link
Copy Markdown
Owner

Let's gooooo! Thank you i will test this and try and pull it in soon

@zhuzhuyule

Copy link
Copy Markdown
Author

I just casually whipped up a table:
https://www.myvibe.so/zhangfan/sherpa-onnx-asr-models

@zhuzhuyule

Copy link
Copy Markdown
Author

Closing to recreate from a dedicated feature branch (this PR's head was fork/main which now contains unrelated changes).

@zhuzhuyule zhuzhuyule closed this Feb 28, 2026
@zhuzhuyule zhuzhuyule reopened this Feb 28, 2026
@zhuzhuyule zhuzhuyule changed the title feat: add Paraformer engine with punctuation support feat: add Paraformer, Zipformer CTC & Transducer engines with punctuation support Feb 28, 2026
@cjpais

cjpais commented Mar 1, 2026

Copy link
Copy Markdown
Owner

Okay, when I'm looking at this, it's becoming increasingly obvious we need to significantly modify the codebase. I am going to do this, and then let's get these in. If you don't mind waiting and rebasing on top of this, it would be great.

Basically I want to separate things out into engines

  • whisper.cpp
  • whisperfile
  • onnx
  • mlx?
  • ggml?

or similar, so we can then implement models per engine as well. i think this will be a much better way forward, but will require some better documentation. I think we can get something going like auto model porting as well from a given base implementation (usually hf transformers). We can potentially try and support the transformers implementations too, but largely I'm not super focused on that for the moment.

@zhuzhuyule

Copy link
Copy Markdown
Author

Before discovering your project, I actually used the sherpa-rs-sys crate, which worked exceptionally well. It not only supported streaming transcription but also allowed the integration of a wider range of models. The only drawback was the third-party code signing issue we encountered during project installation—this arose because we utilized third-party dynamic libraries in the project.

You may want to try out the forked branch I built based on your 0.6.8 version:

  • I initially added support for cloud-based speech models due to the low hardware specifications of my device at the time.
  • Later on, I further implemented streaming models for real-time transcription.
  • Additional features include a demo showcase, transcription history analysis, and more.
  • I also fully refactored the UI/UX of the application.

I originally intended to submit a PR for these changes, but ultimately abandoned the idea due to the extensive scope of the modifications.

You can check out this branch here: https://github.com/zhuzhuyule/Votype/tree/votype

image image

@zhuzhuyule

Copy link
Copy Markdown
Author

Okay, when I'm looking at this, it's becoming increasingly obvious we need to significantly modify the codebase. I am going to do this, and then let's get these in. If you don't mind waiting and rebasing on top of this, it would be great.

Basically I want to separate things out into engines

  • whisper.cpp
  • whisperfile
    ...

Thanks for the plan! I totally support the idea of separating engines — it makes the architecture much cleaner.

One thing I'd like to share:I chose sherpa-onnx specifically because it already supports a huge variety of languages and models (100+ languages with Paraformer/Zipformer). While it may not match the quality of the latest SOTA models, it's practically "good enough" for most use cases and covers far more languages than whisper.cpp alone.

This makes me wonder:Should broad language coverage be a high-priority goal for this project? If so, onnx (via sherpa-onnx) might deserve some extra attention in the new engine architecture.

The main downside of sherpa-onnx is that sherpa-rs-sys can be a bit tricky to install. Do you have any thoughts on how to handle that in the new setup? Or maybe there's a cleaner way to package the sherpa dependencies?

@cjpais

cjpais commented Mar 2, 2026

Copy link
Copy Markdown
Owner

Largely I love sherpa-onnx as well, and have used it in other projects. I mostly didn't pull it in due to dep issues I ran into when trying to use it in Handy. And at this point AI can more or less reimplement inference engines based on another reference. Basically it's possible to automate porting from transformers, or sherpa-onnx more or less, and at the moment that seems to be a better solution to me. Just because of all these dep nightmares. I would rather contain the dependencies to a known tree and build from that.

Broad language coverage is a goal for sure.

Perhaps the bindings to sherpa are just not very good and there's a better way to build/distribute them. I've just not taken a deep look yet. But since most everything is onnx anyway, porting is fairly straightforward and honestly prefer this way. There's probably fairly low hanging fruit in terms of automating this pipeline too..

Point at a transformers model and output:

  • onnx
  • mlx
  • ggml
  • burn
  • candle
  • etc....

Not just in terms of porting weights to the respective formats, but also automating the actual inference code generation too into a variety of languages. You could imagine this being done for Rust (like here), C/C++, Golang, Swift, JS/TS, etc.

And have the logits verified. I think this is reasonable enough to do, and is a direction I'm thinking a lot about. If you want to help, would love to discuss further

@cjpais

cjpais commented Mar 13, 2026

Copy link
Copy Markdown
Owner

@zhuzhuyule for what it's worth I did the base level refactor. would love if you want to move this code into the new format. should be fairly straightforward I imagine.

@kakapt

kakapt commented Mar 19, 2026

Copy link
Copy Markdown

The zipformer models support my native language, would love this feature to be merged!

@csukuangfj

Copy link
Copy Markdown

I suggest that you use
https://crates.io/crates/sherpa-onnx

You can find doc at
https://k2-fsa.github.io/sherpa/onnx/rust-api/install.html

and examples at
https://github.com/k2-fsa/sherpa-onnx/tree/master/rust-api-examples

@cjpais

cjpais commented Mar 22, 2026

Copy link
Copy Markdown
Owner

I think this is also probably the way forward @csukuangfj just need to test it plays nicely with the Handy CI/CD at this point. That almost certainly was one of the original blockers

Thanks for all the work you and your team do, sherpa-onnx is wonderful. It was very fun playing with it on some RK based boards recently, and using the NPU :)

For what it's worth if we add sherpa-onnx, which we probably should, it should be a new engine type. We may still choose to implement the ONNX ourselves, but being able to use the upstream would be much nicer on average. Also quite frankly more trustworthy than our implementations until we get better validation/verification of our own implementations

zhuzhuyule and others added 11 commits March 29, 2026 18:28
Port compute_fbank_kaldi from backup branch as compute_kaldi_fbank with
KaldiFbankConfig (sample_rate u32, Povey window, DC removal, natural log,
negative high_freq Kaldi convention). Registered in features::mod.rs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Non-autoregressive ASR model with custom fbank (Hamming/dB scale),
LFR stacking, mean-only CMVN, and @@-subword symbol table decoding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements ZipformerCtcModel with SpeechModel trait, Kaldi fbank feature
extraction, and CTC greedy decode using BbpeSymbolTable. Supports both
standard model.onnx naming and sherpa-onnx directory-scan fallback.
Rejects streaming models that contain cached_* inputs at load time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three-session RNN-T architecture (encoder, decoder, joiner) with greedy
search decoding. Auto-detects I/O names and model file naming conventions.
Rejects streaming models at load time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements PunctModel backed by CT-Transformer ONNX, with sliding-window
inference (20-token chunks, 2-token overlap) and smart CJK/ASCII punctuation
selection. Adds independent `punct` feature gate and updates `all`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add example binaries and integration tests for ParaformerModel,
ZipformerCtcModel, and ZipformerTransducerModel following the existing
gigaam/sense_voice patterns. Tests skip gracefully when model files
are absent; examples accept positional args and --int8 flag.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix loop variable indexing in kaldi_fbank.rs
- Apply cargo fmt to all new files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CT-Transformer punctuation model expects int32 input tensors, but
the code was casting token IDs from i32 to i64. Use i32 directly for
both input_array and length_array. Also make output extraction flexible
(try i64 first, fall back to i32) since different model versions may
output different types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CT-Transformer punct model outputs float32 logits with shape
[batch, seq_len, num_classes=6], not pre-argmaxed integers. Apply
argmax along the last axis to get punctuation class IDs. Fall back
to i64/i32 extraction for models that output pre-argmaxed values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reduce log noise during normal operation:
- ONNX session model input/output tensor info → DEBUG
- BBPE encoding detection → DEBUG
- Punct model token count and input names → DEBUG
- Zipformer model file discovery → DEBUG

Error and warning logs (model load failures, inference errors) remain
at WARN/ERROR level for visibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@zhuzhuyule

Copy link
Copy Markdown
Author

Update: v0.3.5 — Punct Model Fixes, GPU Accel, Log Cleanup

What changed since initial PR

  1. Punct model I/O fix: Input tensors now use int32 (was incorrectly cast to int64). Output extraction handles float32 logits via argmax (model outputs [batch, seq_len, 6] probabilities, not pre-argmaxed integers).

  2. GPU acceleration module (accel): set_whisper_accelerator(), set_ort_accelerator(), set_whisper_gpu_device(), and list_gpu_devices() for runtime GPU selection.

  3. Log noise reduction: ONNX session tensor info, BBPE detection, punct token loading, and model file discovery logs downgraded from INFO → DEBUG.

Recommended Usage Pattern

Engines that output punctuated text (no external punct needed):

  • Whisper (all variants)
  • SenseVoice

Engines that output raw text without punctuation (need external punct model):

  • Zipformer Transducer
  • Zipformer CTC
  • Paraformer
  • GigaAM

Auto-detect + apply pattern:

let result = engine.transcribe(&audio, &TranscribeOptions::default())?;

// Check if output already has punctuation
let has_punct = result.text.chars().any(|c| matches!(c, 
    ',' | '。' | '?' | '!' | ';' | ',' | '.' | '?' | '!' | ';'
));

if !has_punct && !result.text.is_empty() {
    let mut punct = PunctModel::new(Path::new("models/punct-model/"))?;
    let punctuated = punct.add_punctuation(&result.text);
    // Use punctuated text
}

Notes & Caveats

  • Streaming models not supported: Current ZipformerTransducerModel / ZipformerCtcModel / ParaformerModel load the full audio and run offline inference. Streaming model files (filename containing streaming) will load but may produce incorrect results — callers should use offline model variants only.
  • Punct model is stateless per-call: add_punctuation() processes text in 20-token windows with 2-token overlap. For realtime preview, callers should manage their own caching/anchoring strategy.
  • CT-Transformer int8 model recommended: 62MB, fast (~50ms for typical sentences). Full-precision model (266MB) gives marginally better accuracy but 3x slower.

@zhuzhuyule zhuzhuyule changed the title feat: add Paraformer, Zipformer CTC & Transducer engines with punctuation support feat: add Paraformer, Zipformer, Punctuation, GPU acceleration (v0.3.5) Mar 30, 2026
- Remove llm_postprocess module (not yet ported, broke example build)
- Remove stale docs and plan files
- Fix clippy skip(0) warning in punct.rs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants